perm filename HYRS.RFS[UP,DOC] blob sn#002762 filedate 1972-10-06 generic text, type T, neo UTF8
PRODUCTION INTERPRETER AND SCANNER.

This write-up describes briefly a group of programs which can be used
to  compile  production  tables  from  a  reasonably  clear  list  of
productions, and then play with various input sentences.  This is the
same  production  compiler  and  interpreter that is used by the SAIL
compiler.  We make brief mention in the writeup  of  the  method  for
attaching semantic routines to your syntax analyzer.


I. Production Compiler

The production compiler expects an  input  file  of  a  very  special
format.  The ultimate aim of this input file is to specify a sequence
of productions, but we must first specify  the  production  alphabet,
both terminal and non-terminal.

The  meta  language for specifying productions consumes a few symbols
and conventions.  First, all alphabetic characters in the input  file
must be in upper case.  The only delimiters are space and tab, so →AG
does NOT get interpreted as two separate symbols.


1.  	For  various  undisclosed  reasons,  we  must  first  provide
alternate "names" for all single-letter delimiters we may need,  such
as  (  ) } ↑ [ ], etc.  The pseudo-op <SYMBOLS> is given, followed by
pairs of

	single-letter-delimiter crazy-alternate-name

See the example below for some  instances  of  this  phenomenon.  The
crazier  the  name  the  better.   I suggest that each of these names
begin with the character $, to avoid duplication elsewhere.


2. 	The terminal symbols of the language are then specfied in two
groups:  first, those which the scanner  knows  about  directly,  and
second,  the group of reserved words in the language (these look like
identifiers to the  scanner,  but  you  obviously  desire  a  special
interpretation).  The  first  group  is  initiated with the pseudo-op
<TERMINALS>, and must include all the single-letter-delimiters  cited
in  1,  plus  three  other  symbols  WHICH  MUST ALWAYS BE EXACTLY AS
FOLLOWS:

	IDENT	NUMBR	STRIN

These are the "names" that the scanner will return for (respectively)
identifiers   that  are  not  reserved  words,  numbers,  and  string
constants. The second group is started with <RESERVED-WORDS>, and  is
merely a list of all the reserved words you desire.

3.	The non-terminal symbols are then declared.  These are things
which you may want to put on the  stack  as  "markers"  of  partially
completed   reductions.    The  pseudo-op  <NON-TERMINAL-SYMBOLS>  is
followed by a list of such symbols.

4.	For efficiency reasons, it is helpful to  define  CLASSES  of
symbols.  Then with one production, we can determine if a whole class
of productions are applicable, and we can often avoid stating all the
productions.   To  specify  classes,  we  start  with  the  pseudo-op
<CLASSES>, and then, on a one-class-to-a-line basis, we specif%y:

	class-name	class-element class-element ....

where class-name must begin with a @. All of the class-elements  must
have  already  been  defined  in  2  or 3 above, and hence may not be
classe-names.

5.	Finally we are ready to state the productions.  We  give  the
pseudo-op <PRODUCTIONS> to start this off.  The very first production
listed will be the starting point  for  the  production  interpreter.
The  interpreter  will  start  AFTER ONE SCAN, so that the top of the
stack will already have the first "parse token" on it.

The syntax of each production is:

label:	LHS →→ RHS EXEC xxx SCAN α ¬yyy #zzz ↓↓ ↑www

This specifies a  production.   These  symbols  need  at  least  some
explanation:
-- the label is the "name" of the production.  It  is  optional.  All
        labels  in  your  program must begin with the character % and
        must be unique in their first 6 characters.  There may not be
        a space between the last character of the label and the :  .

-- LHS is a left-hand-side-list.  That is, a list of symbols declared
        in 2,3, or 4 above which is to be matched against the current
        top elements of the stack.

--  RHS is a right-hand-side-list.  This is the list of symbols which
        is to  replace  the  LHS  on  the  stack  if  the  production
        succeeds.

--  EXEC  specifies  that a list of execs is to be called.  The names
        that you give in this list of exec  routines  should  be  the
        names of the procedures that you make up in your exec routine
        file.

-- SCAN specifies that we are to scan before going  on  to  the  next
        production.

--  the  # ¬ ↑ and ↓↓ parts specify where "control" of the production
        interpreter is to go.

The following things are omittable:

-- the label

-- the →→ and right-hand-side-list, in which case it is assumed  that
        if  the  production  succeeds,  the  stack is restored to its
        original condition.

-- the right-hand-side-list, in which case nothing is restored to the
        stack.

-- EXEC and the list of exec routines which follows it.

--  SCAN  if  you do not want to scan.  Note that SCAN α means scan α
        times before going to the success spot.

-- the #zzz may be omitted.

-- the ↓↓ may be omitted.

__ the ↑www may be omitted. 

Now for the interpretation.  When the interpreter is pointed at this
production, the stack is compared  against  the  left-hand-side-list.
The  last element in this list is compared against the current top of
the stack, and so on back the list and up the  stack.   Compares  are
equal if:

-- the symbols actually match.

-- the list element in the left-hand-side is  SG,  which  stands  for
        sigma,  and  compares  equal  to  anything. SG is thus a meta
        symbol of the production compiler, and may not be used in any
        other way.

--  the  list element in the left-hand-side is a class symbol and the
        corresponding stack element belongs to  that  class.  If  the
        left-hand-side-list  fails  to match the top positions of the
        stack, the production interpreter sees a failure and goes  to
        another  production.  The next production is normally the one
        following the one which failed.  If you want to specify  some
        other  failure  production, the #zzz construct is used, where
        zzz is the label of the production you want to go to.

If the left-hand-side-list matches the top of the stack, then:

1. These stack elements are popped off the stack temporarily.

2. If any exec routines have been specified, these  are  called.  The
        statement  EXEC FOOBAZ causes the subroutine called FOOBAZ to
        be  called.   The  statement  EXEC  @DCL  FOOBAZ  causes  the
        following  to happen: FOOBAZ is a routine which wants to know
        some  class  information  about  one  of  the  left-hand-side
        elements.   This  element  is specified by the @DCL (this was
        the same thing that occurred on the left-hand-side). Just how
        this  class information is passed is not important here. Upon
        return from all exec routines, we continue to:

3. The stack is  then  fixed  back  up,  reflecting  any  changes  as
        specified in the right-hand-side-list.

4. The scanner is called if SCAN was specified.

5.  The  production has now "succeeded".  We must cast around for the
        next   production.    Each   production   must   have    some
        specification  of  what  to  do  on  success.   If  you  only
        specified a ¬yyy , then we do a "jump" to the production with
        label yyy. If you specified a ↑aaa and a ¬yyy , we do a "push
        jump"  to  production  aaa.   When  we   return   from   that
        "subroutine,"  we  will go to production yyy.  The "pop jump"
        is specified by ↓↓.  It makes no sense to say: ¬aaa ↓↓  since
        these  two  operations  conflict  (in fact, the pop jump will
        take precedence).

The list of productions must be followed by the pseudo-op <END>



II.  Compiling your productions and execs.

If  you  have  completed  your  file of productions as above, you are
ready to give it a whirl.  Suppose you named that file MYPROD.   Then
make another file called, say PRODGO, which is a text file that looks
exactly like this:

QQ.=MYPROD/NOLIST/NOLO/NON SYS:HYRS
QQ=HEAD[S,AIL]+PARSER[1,RFS]+QQ/FORWARD,/SAIL SCANR[1,RFS],EXECS


This will be a command file which starts the whole  mess  in  motion.
You  need  a  file  containing exec routines stated in the production
list. If you merely want to play, and just want dummy exec  routines,
allow me to suggest the following glorious SAIL macro:

DEFINE exec(x)="internal procedure x;
	out(1,"" x ""&('15&'12));";

The scanner has the TTY inited on channel 1, so this will  cause  the
name  of  the exec routine to be printed out.  The macro is used very
simply. If you asked for routines called RETRY,ERRG,STORE, just say

	exec (RETRY)
	exec (ERRG)
	exec (STORE)

These definitions of exec routines should be in a file  called  EXECS
(or  you  may  change  the name in the command file, see above). This
file has an ENTRY statement at the beginning to avoid putting  out  a
starting  address.   The scanner is the guy who is "started," not the
exec routine package.  There is a sample  exec  routine  file  in  VI
below.

Now  you  are  ready to try things.  To get everything to compile and
load, type to the time-sharing system EXECUTE @PRODGO.   This  causes
the command file to be scanned and the world to whirr.



III.  Scanner and debugger.

The scanner will ask you some stupid questions,  like  "filename  for
scan  table".   If you used the command file format cited above, type
QQ followed by a carriage return.  The scanner  is  now  loading  the
scan  table  and  the  reserved word symbol table.  Then you will get
"input device" spewed out. It is asking where to get its input  from.
Tell  it,  with TTY or DSK or something.  If it requires a file name,
it will not hesitate to ask.  If you give it a bum filename, it  will
not hesitate to ask again.

Then the production interpreter is initialized, the scanner is called
once, and then the first production is interpreted.  But to allow you
to watch the progress of the production analysis and stack twiddling,
we print out the production name (the label you gave the production),
and the top 3 entries on the stack.  If there was no production name,
we take the nearest one (say RP0) and start making RP0A, RP0B, etc.

Then the onus is on you the user to tell the interpreter what  is  to
happen next.  This is done by typing something at the interpreter:

P	--proceed.   Go  on  with what you were doing.  When in doubt
        about the fancy features below, just keep typing P.

B	--set or remove a breakpoint on  a  production.  Follow  this
        immediately  by  S  to  set a breakpoint, or R to remove one,
        followed by the production name, followed by  a  space.   The
        production   interpreter   will   stop   whenever  it  begins
        consideration of a production which has a breakpoint on it.

#M	--set mode.  The number preceeding the M should be:

	1 -- print just exec routine calls.

	2 -- print nothing

	3 -- print everything (exec routines and prodcutions)

	4 -- multiple proceed.  If you type 4MP, the thing will  just
        scream along merrily showing you whatever you have asked for.

	5 -- display only the line being scanned.

	6 -- super-fast mode.  Ignore the user altogether.

D	--go to DDT or RAID.

#S	--look  at stack element of number #.  0 is the top of stack,
        and positive numbers go back up the stack.

These  features  should  allow  you  to  monitor the progress of your
syntax analyzer, and see where you may need to change productions.


IV.  Attaching semantic routines to your productions.

This section is not intended for those people who just want to  play.
It  will  get rather involved with SAIL constructs, and is not at all
necessary for just playing around with syntax. Now that only the most
determined  people  are still reading, allow me to say that it is not
really very complicated.

If the production interpreter finds a successful production, it  pops
off the left-hand-side elements of the stack into some temporaries in
core.  It then makes up a copy,  in  core  temporaries,  of  how  the
right-hand-side  will look when it is finally restored. There are two
stacks that are manipulated in all these operations, the parse  stack
and  the  semantic  stack.   The operations on the parse and semantic
stacks are always identical.

Consider the following production:

	ELHS E SG →→ ELHS T SG

When  the  interpreter  pops  off  the  left-hand-side  elements,  it
associates with each a number (distance from top of stack):

	SG	0
	E	1
	ELHS	2

These  numbers  are  the  references  to  the  semantic stack entries
associated with the parse tokens stated.  As the  temporary  copy  of
the right-hand-side is created in core, there are several rules:

-- if we are putting on a brand new parse token at position α  (α  is
        the  distance  from  the new top of stack), then the semantic
        element from the αth location on the left-hand-side is copied
        for the α location on the right-hand-side.

--  if  we  are  putting  on  a  parse  token  which  occurs  on  the
        left-hand-side at location  β,  then  we  copy  location  β's
        semantic entry to the right-hand-side location α. If there is
        more  than  one  occurrence  of  the  parse  token   on   the
        left-hand-side, we assume the one with the highest β.

So this information is sufficient to specify how the semantic entries
come  and  go.   The purpose of the first rule above is to facilitate
such things as

	T SG →→ E SG

promoting a term to an expression, and saving the  semantics  of  the
term to associate with the expression.

Now  for  the  routines  which access the semantic stack entries. The
routine  GETSEM  (integer  β)  returns  the  integer  entry  in   the
left-hand-side entry β.  The routine PUTSEM (integer α; integer word)
deposits the semantics "word" into right-hand-side  location  α.  The
SAIL  declarations for these routines should be made in your program,
and should look like:

	EXTERNAL INTEGER PROCEDURE GETSEM(INTEGER β);
	EXTERNAL PROCEDURE PUTSEM(INTEGER α,word);

When  the  scanner  scans  an  identifier, it records in the semantic
stack  entry  a zero.  The  string  for  the  most  recently  scanned
identifier or string constant (same format as SAIL string constants )
is in  the  string  IDNAME (which you  must have declared  EXTERNAL).
The string for the  next most recently  scanned  identifier is in the
string OLDID.  If the  scanner saw a number, the decimal value of the
number is in SCNDEC, and the octal in SCNOCT.  If there was an 8 or 9
in the input stream, SCNOCT is negative.

There  is  a  facility  for  passing  class-type  information to exec
routines. The production interpreter stores a number in GENCLS (which
you  must  declare  as  an  external integer) before calling the exec
routine. This number contains the type  information  you  need.   The
problem   of   interpreting   that  number  is  complicated  but  not
impossible. The elementary statement is:  the  number  is  the  index
(starting  at 0) into the line of symbols in which that class type is
declared. So if you declare the class type

	@DCL INTEGER REAL BOOLEAN COMPLEX STRING

and if you wrote a production

%CON:	@DCL IDENT , →→ @DCL	EXEC ENTID SCAN 2	¬%CON

the when ENTID is called, GENCLS will contain the following numbers:

	parse token which compared
	equal to @DCL
	INTEGER					0
	REAL					1
	BOOLEAN					2
	COMPLEX					3
	STRING					4

This simple explanation breaks down when, say BOOLEAN was declared to
be a member of another class, and THAT CLASS WAS DECLARED BEFORE @DCL
WAS.  So the easy rule is: declare the classes that will be  required
to  provide  class  information  to semantic routines first, and then
those classes which are used only for parsing purposes.




V. Sample Production File:


<SYMBOLS>

; $SEMI - $MINUS + $PLUS * $TIMS / $DIV % $MOD ← $LARW
∧ $AND ⊗ $SHIFT : $COLW
( $LPRN ) $RPRN , $COMA

<TERMINALS>
 : ( ) , ; - + * / % ← ∧ ⊗ IDENT NUMBR STRIN


<RESERVED-WORDS>

IF THEN ELSE BEGIN END GO LABEL INTEGER REAL

<NON-TERMINAL-SYMBOLS>

BLAT S T E P LHS SIFC EIFC

<CLASSES>
@PM	+ - ⊗ ∧
@TD	* /
@DCL	INTEGER REAL LABEL

<PRODUCTIONS>

MUMBLE  IS THE WORD YOU USE TO MAKE COMMENTS IN THE PRODUCTIONS.
	FOR OBVIOUS REASONS, WE CANNOT USE "COMMENT". THE
	END OF THE COMMENT IS SIGNALLED BY A SEMICOLON ;

%PR1:	BEGIN →→ BLAT BEGIN	EXEC BEG1 SCAN 2	¬%DS #%ERT
%DS:	@DCL SG →→ SG		EXEC @DCL DECL SCAN	¬%IDL #%S1

%IDL:	IDENT , →→		EXEC ENTID SCAN 2	¬%IDL
	IDENT ; →→		EXEC ENTID ENDD SCAN 2	¬%DS


%S1:	BEGIN SG		SCAN			¬%DS
	IDENT ← →→ LHS		EXEC TWID SCAN 2	¬%EX1
	IF SG →→ SIFC SG	EXEC IFBEG SCAN		¬%EX1
	GO IDENT →→ S		EXEC TRA SCAN		¬%S9
	IDENT : →→		EXEC ENTLAB SCAN 2	¬%S1
%ERT:	SG →→			EXEC ERRTR SCAN		¬%S1

MUMBLE THIS LAST IS THE ERROR MAN.  HE MIGHT THROW IN THE TOWEL ;

%EX1:	( SG			SCAN			¬%EX1
	IDENT SG →→ P SG				¬%P2
	NUMBR SG →→ P SG				¬%P2 #%ERT

%P2:	T @TD P SG →→ T SG	EXEC @TD TIMDIV		¬%T2
	P SG →→ T SG					¬%T2
%T2:	T @TD			SCAN 2			¬%EX1
	E @PM T SG →→ E SG	EXEC @PM PLUSM		¬%E2
	T SG →→ E SG					¬%E2
%E2:	E @PM			SCAN 2			¬%EX1
	( E ) →→ P		SCAN			¬%P2
	SIFC E THEN →→ SIFC	EXEC BOOL SCAN 2	¬%S1
	LHS E SG →→ S SG	EXEC STORE		¬%S9

%S9:	BEGIN S ; →→ BEGIN	SCAN 2			¬%S1
	BEGIN S END →→ S	SCAN			¬%S9
	SIFC S ELSE		EXEC IF1 SCAN 2		¬%S1
	SIFC S SG →→ S SG	EXEC IF2		¬%S9
	SIFC S ELSE S SG →→ S SG EXEC IF3		¬%S9
	BLAT S ; →→		EXEC DONES		¬%S1


<END>










VI. Sample exec routine file:



ENTRY DONES;

BEGIN "EXECS"

DEFINE EXEC(X)="INTERNAL PROCEDURE X;
	OUT(1,"" X ""&('15&'12));";


EXEC	(BEG1)
EXEC	(DECL)
EXEC	(ENTID)
EXEC	(ENDD)
EXEC	(TWID)
EXEC	(IFBEG)
EXEC	(TRA)
EXEC	(ENTLAB)
EXEC	(ERRTR)
EXEC	(TIMDIV)
EXEC	(PLUSM)
EXEC	(BOOL)
EXEC	(STORE)
EXEC	(IF1)
EXEC	(IF2)
EXEC	(IF3)
EXEC	(DONES)


END